1,402 research outputs found

    Using linear predictors to impute allele frequencies from summary or pooled genotype data

    Full text link
    Recently-developed genotype imputation methods are a powerful tool for detecting untyped genetic variants that affect disease susceptibility in genetic association studies. However, existing imputation methods require individual-level genotype data, whereas, in practice, it is often the case that only summary data are available. For example, this may occur because, for reasons of privacy or politics, only summary data are made available to the research community at large; or because only summary data are collected, as in DNA pooling experiments. In this article we introduce a new statistical method that can accurately infer the frequencies of untyped genetic variants in these settings, and indeed substantially improve frequency estimates at typed variants in pooling experiments where observations are noisy. Our approach, which predicts each allele frequency using a linear combination of observed frequencies, is statistically straightforward, and related to a long history of the use of linear methods for estimating missing values (e.g., Kriging). The main statistical novelty is our approach to regularizing the covariance matrix estimates, and the resulting linear predictors, which is based on methods from population genetics. We find that, besides being both fast and flexible---allowing new problems to be tackled that cannot be handled by existing imputation approaches purpose-built for the genetic context---these linear methods are also very accurate. Indeed, imputation accuracy using this approach is similar to that obtained by state-of-the-art imputation methods that use individual-level data, but at a fraction of the computational cost.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS338 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Empirical Bayes Shrinkage and False Discovery Rate Estimation, Allowing For Unwanted Variation

    Full text link
    We combine two important ideas in the analysis of large-scale genomics experiments (e.g. experiments that aim to identify genes that are differentially expressed between two conditions). The first is use of Empirical Bayes (EB) methods to handle the large number of potentially-sparse effects, and estimate false discovery rates and related quantities. The second is use of factor analysis methods to deal with sources of unwanted variation such as batch effects and unmeasured confounders. We describe a simple modular fitting procedure that combines key ideas from both these lines of research. This yields new, powerful EB methods for analyzing genomics experiments that account for both sparse effects and unwanted variation. In realistic simulations, these new methods provide significant gains in power and calibration over competing methods. In real data analysis we find that different methods, while often conceptually similar, can vary widely in their assessments of statistical significance. This highlights the need for care in both choice of methods and interpretation of results. All methods introduced in this paper are implemented in the R package vicar available at https://github.com/dcgerard/vicar .Comment: 42 pages, 11 figures, 3 table

    Small World MCMC with Tempering: Ergodicity and Spectral Gap

    Full text link
    When sampling a multi-modal distribution Ο€(x)\pi(x), x\in \rr^d, a Markov chain with local proposals is often slowly mixing; while a Small-World sampler \citep{guankrone} -- a Markov chain that uses a mixture of local and long-range proposals -- is fast mixing. However, a Small-World sampler suffers from the curse of dimensionality because its spectral gap depends on the volume of each mode. We present a new sampler that combines tempering, Small-World sampling, and producing long-range proposals from samples in companion chains (e.g. Equi-Energy sampler). In its simplest form the sampler employs two Small-World chains: an exploring chain and a sampling chain. The exploring chain samples Ο€t(x)βˆΟ€(x)1/t\pi_t(x) \propto \pi(x)^{1/t}, t∈[1,∞)t\in [1,\infty), and builds up an empirical distribution. Using this empirical distribution as its long-range proposal, the sampling chain is designed to have a stationary distribution Ο€(x)\pi(x). We prove ergodicity of the algorithm and study its convergence rate. We show that the spectral gap of the exploring chain is enlarged by a factor of tdt^{d} and that of the sampling chain is shrunk by a factor of tβˆ’dt^{-d}. Importantly, the spectral gap of the exploring chain depends on the "size" of Ο€t(x)\pi_t(x) while that of sampling chain does not. Overall, the sampler enlarges a severe bottleneck at the cost of shrinking a mild one, hence achieves faster mixing. The penalty on the spectral gap of the sampling chain can be significantly alleviated when extending the algorithm to multiple chains whose temperatures {tk}\{t_k\} follow a geometric progression. If we allow tkβ†’0t_k \rightarrow 0, the sampler becomes a global optimizer.Comment: 24 pages, 3 figure

    Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls

    Full text link
    Unwanted variation, including hidden confounding, is a well-known problem in many fields, particularly large-scale gene expression studies. Recent proposals to use control genes --- genes assumed to be unassociated with the covariates of interest --- have led to new methods to deal with this problem. Going by the moniker Removing Unwanted Variation (RUV), there are many versions --- RUV1, RUV2, RUV4, RUVinv, RUVrinv, RUVfun. In this paper, we introduce a general framework, RUV*, that both unites and generalizes these approaches. This unifying framework helps clarify connections between existing methods. In particular we provide conditions under which RUV2 and RUV4 are equivalent. The RUV* framework also preserves an advantage of RUV approaches --- their modularity --- which facilitates the development of novel methods based on existing matrix imputation algorithms. We illustrate this by implementing RUVB, a version of RUV* based on Bayesian factor analysis. In realistic simulations based on real data we found that RUVB is competitive with existing methods in terms of both power and calibration, although we also highlight the challenges of providing consistently reliable calibration among data sets.Comment: 34 pages, 6 figures, methods implemented at https://github.com/dcgerard/vicar , results reproducible at https://github.com/dcgerard/ruvb_sim

    Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene-environment interactions

    Full text link
    Genetic association analyses often involve data from multiple potentially-heterogeneous subgroups. The expected amount of heterogeneity can vary from modest (e.g., a typical meta-analysis) to large (e.g., a strong gene--environment interaction). However, existing statistical tools are limited in their ability to address such heterogeneity. Indeed, most genetic association meta-analyses use a "fixed effects" analysis, which assumes no heterogeneity. Here we develop and apply Bayesian association methods to address this problem. These methods are easy to apply (in the simplest case, requiring only a point estimate for the genetic effect and its standard error, from each subgroup) and effectively include standard frequentist meta-analysis methods, including the usual "fixed effects" analysis, as special cases. We apply these tools to two large genetic association studies: one a meta-analysis of genome-wide association studies from the Global Lipids consortium, and the second a cross-population analysis for expression quantitative trait loci (eQTLs). In the Global Lipids data we find, perhaps surprisingly, that effects are generally quite homogeneous across studies. In the eQTL study we find that eQTLs are generally shared among different continental groups, and discuss consequences of this for study design.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS695 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies

    Full text link
    Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, fitting mvLMMs is computationally non-trivial, and no existing method is computationally practical for performing the likelihood ratio test (LRT) for mvLMMs in GWAS settings with moderate sample size n. The existing software MTMM perform an approximate LRT for two phenotypes, and as we find, its p values can substantially understate the significance of associations. Here, we present novel computationally-efficient algorithms for fitting mvLMMs, and computing the LRT in GWAS settings. After a single initial eigen-decomposition (with complexity O(n^3)) the algorithms i) reduce computational complexity (per iteration of the optimizer) from cubic to linear in n; and ii) in GWAS analyses, reduces per-marker complexity from cubic to quadratic in n. These innovations make it practical to compute the LRT for mvLMMs in GWASs for tens of thousands of samples and a moderate number of phenotypes (~2-10). With simulations, we show that the LRT provides correct control for type I error. With both simulations and real data we find that the LRT is more powerful than the approximate LRT from MTMM, and illustrate the benefits of analyzing more than two phenotypes. The method is implemented in the GEMMA software package, freely available at http://stephenslab.uchicago.edu/software.htm

    Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

    Full text link
    Many common diseases are highly polygenic, modulated by a large number genetic factors with small effects on susceptibility to disease. These small effects are difficult to map reliably in genetic association studies. To address this problem, researchers have developed methods that aggregate information over sets of related genes, such as biological pathways, to identify gene sets that are enriched for genetic variants associated with disease. However, these methods fail to answer a key question: which genes and genetic variants are associated with disease risk? We develop a method based on sparse multiple regression that simultaneously identifies enriched pathways, and prioritizes the variants within these pathways, to locate additional variants associated with disease susceptibility. A central feature of our approach is an estimate of the strength of enrichment, which yields a coherent way to prioritize variants in enriched pathways. We illustrate the benefits of our approach in a genome-wide association study of Crohn's disease with ~440,000 genetic variants genotyped for ~4700 study subjects. We obtain strong support for enrichment of IL-12, IL-23 and other cytokine signaling pathways. Furthermore, prioritizing variants in these enriched pathways yields support for additional disease-association variants, all of which have been independently reported in other case-control studies for Crohn's disease.Comment: Summitted to PLoS Genetic

    Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays

    Full text link
    Understanding how genetic variants influence cellular-level processes is an important step towards understanding how they influence important organismal-level traits, or "phenotypes", including human disease susceptibility. To this end scientists are undertaking large-scale genetic association studies that aim to identify genetic variants associated with molecular and cellular phenotypes, such as gene expression, transcription factor binding, or chromatin accessibility. These studies use high-throughput sequencing assays (e.g. RNA-seq, ChIP-seq, DNase-seq) to obtain high-resolution data on how the traits vary along the genome in each sample. However, typical association analyses fail to exploit these high-resolution measurements, instead aggregating the data at coarser resolutions, such as genes, or windows of fixed length. Here we develop and apply statistical methods that better exploit the high-resolution data. The key idea is to treat the sequence data as measuring an underlying "function" that varies along the genome, and then, building on wavelet-based methods for functional data analysis, test for association between genetic variants and the underlying function. Applying these methods to identify genetic variants associated with chromatin accessibility (dsQTLs) we find that they identify substantially more associations than a simpler window-based analysis, and in total we identify 772 novel dsQTLs not identified by the original analysis

    varbvs: Fast Variable Selection for Large-scale Regression

    Full text link
    We introduce varbvs, a suite of functions written in R and MATLAB for regression analysis of large-scale data sets using Bayesian variable selection methods. We have developed numerical optimization algorithms based on variational approximation methods that make it feasible to apply Bayesian variable selection to very large data sets. With a focus on examples from genome-wide association studies, we demonstrate that varbvs scales well to data sets with hundreds of thousands of variables and thousands of samples, and has features that facilitate rapid data analyses. Moreover, varbvs allows for extensive model customization, which can be used to incorporate external information into the analysis. We expect that the combination of an easy-to-use interface and robust, scalable algorithms for posterior computation will encourage broader use of Bayesian variable selection in areas of applied statistics and computational biology. The most recent R and MATLAB source code is available for download at Github (https://github.com/pcarbo/varbvs), and the R package can be installed from CRAN (https://cran.r-project.org/package=varbvs).Comment: 31 pages, 6 figure

    Flexible signal denoising via flexible empirical Bayes shrinkage

    Full text link
    Signal denoising---also known as non-parametric regression---is often performed through shrinkage estimation in a transformed (e.g., wavelet) domain; shrinkage in the transformed domain corresponds to smoothing in the original domain. A key question in such applications is how much to shrink, or, equivalently, how much to smooth. Empirical Bayes shrinkage methods provide an attractive solution to this problem; they use the data to estimate a distribution of underlying "effects", hence automatically select an appropriate amount of shrinkage. However, most existing implementations of Empirical Bayes shrinkage are less flexible than they could be--both in their assumptions on the underlying distribution of effects, and in their ability to handle heterskedasticity---which limits their signal denoising applications. Here we address this by taking a particularly flexible, stable and computationally convenient Empirical Bayes shrinkage method, and we apply it to several signal denoising problems. These applications include smoothing of Poisson data and heteroskedastic Gaussian data. We show through empirical comparisons that the results are competitive with other methods, including both simple thresholding rules and purpose-built Empirical Bayes procedures. Our methods are implemented in the R package smashr, "SMoothing by Adaptive SHrinkage in R," available at https://www.github.com/stephenslab/smash
    • …
    corecore